[Directory] Strengthen single activation guarantees even further.#9958
[Directory] Strengthen single activation guarantees even further.#9958ledjon-behluli wants to merge 8 commits intodotnet:mainfrom
Conversation
src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs
Outdated
Show resolved
Hide resolved
|
@ReubenBond addressed the feedback |
src/Orleans.Core.Abstractions/Exceptions/DirectoryLeaseHoldException.cs
Outdated
Show resolved
Hide resolved
…eption.cs Co-authored-by: Reuben Bond <203839+ReubenBond@users.noreply.github.com>
2d3d58a to
1840151
Compare
There was a problem hiding this comment.
Pull request overview
This PR strengthens the strongly-consistent distributed grain directory’s single-activation guarantees by introducing “safety lease holds” after ungraceful silo failures and during recovery scenarios, delaying re-registrations to reduce split-brain risk.
Changes:
- Add silo-level and range-level lease hold tracking to directory partitions, including periodic cleanup of expired leases.
- Enforce lease holds during
RegisterAsyncby rejecting registrations with a newDirectoryLeaseHoldException. - Add configuration (
SafetyLeaseHoldDuration) and a new test validating reactivation blocking after an ungraceful shutdown.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Orleans.GrainDirectory.Tests/Orleans.GrainDirectory.Tests.csproj | Adds Microsoft.Extensions.TimeProvider.Testing package for FakeTimeProvider in tests. |
| test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs | New test exercising lease-hold behavior after KillSiloAsync. |
| src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.cs | Tracks/creates lease holds, applies holds on silo death, and adds cleanup logic + logging. |
| src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.Interface.cs | Enforces lease holds on registration by throwing DirectoryLeaseHoldException. |
| src/Orleans.Runtime/GrainDirectory/DistributedGrainDirectory.cs | Computes effective lease duration, wires TimeProvider, and runs periodic lease cleanup requests. |
| src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs | Adds SafetyLeaseHoldDuration option and documentation; switches to file-scoped namespace. |
| src/Orleans.Core.Abstractions/Exceptions/DirectoryLeaseHoldException.cs | Introduces a new exception type for lease-hold rejections. |
Comments suppressed due to low confidence (2)
src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs:88
- Typo in XML doc comment: "clcks skues" should be "clock skews".
/// Conditional deregistration is used for lazy clean-up of activations whose prompt deregistration failed for some reason (e.g., message failure).
/// This should always be at least one minute, since we compare the times on the directory partition, so message delays and clcks skues have
/// to be allowed.
test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs:65
- Grammar in comment: "its the only one alive" should be "it's the only one alive".
// Time has expired, we can place it now, and it will be the primary as its the only one alive.
Assert.Equal(await leaseGrain.GetAddress(), primary.SiloAddress);
| // The value doesnt matter we will advance time manually. | ||
| public static readonly TimeSpan LeaseHoldDuration = TimeSpan.FromSeconds(5); |
There was a problem hiding this comment.
Spelling in comment: "doesnt" should be "doesn't".
This issue also appears on line 64 of the same file.
| { | ||
| if (_leaseHoldDuration > TimeSpan.Zero) | ||
| { | ||
| // We pessimistically asssume if snapshot transfer failed, than safety is needed. |
There was a problem hiding this comment.
Spelling/grammar in comments: "asssume" -> "assume" and "than" -> "then".
| // We pessimistically asssume if snapshot transfer failed, than safety is needed. | |
| // We pessimistically assume if snapshot transfer failed, then safety is needed. |
| if (lockedRange.Contains(rangeHash)) | ||
| { | ||
| // We reject, the client should retry! | ||
| throw new DirectoryLeaseHoldException($"Range {lockedRange} is under a lease hold until {expiration - utcNow}."); | ||
| } |
There was a problem hiding this comment.
The exception messages use "until {expiration - utcNow}" but {expiration - utcNow} is a remaining duration, not an absolute time. Consider changing the wording to "for {remaining}" or include the absolute expiration timestamp to avoid confusion when diagnosing retries.
| { | ||
| public class GrainDirectoryOptions | ||
| /// <summary> | ||
| /// Configuration type that controls the type of the grain directory caching algorithm that silo use. |
There was a problem hiding this comment.
Typo in XML doc comment: "silo use" should be "silos use".
This issue also appears on line 86 of the same file.
| /// Configuration type that controls the type of the grain directory caching algorithm that silo use. | |
| /// Configuration type that controls the type of the grain directory caching algorithm that silos use. |
| if (_leaseHoldDuration > TimeSpan.Zero) | ||
| { | ||
| // We pessimistically asssume if snapshot transfer failed, than safety is needed. | ||
| var expiration = _timeProvider.GetUtcNow().UtcDateTime.Add(_leaseHoldDuration); | ||
| _rangeLeaseHolds.Add((addedRange, expiration)); | ||
| LogWarningLeaseHoldForRange(_logger, addedRange, expiration); | ||
| } |
There was a problem hiding this comment.
The new range-lease-hold behavior (adding _rangeLeaseHolds on snapshot transfer failure) doesn’t appear to have targeted test coverage. Consider adding tests which (a) force a snapshot transfer failure and assert registrations in that range throw DirectoryLeaseHoldException until expiry, and (b) verify a graceful shutdown path does not apply lease holds.
src/Orleans.Core.Abstractions/Exceptions/DirectoryLeaseHoldException.cs
Outdated
Show resolved
Hide resolved
| foreach (var kvp in _directory) | ||
| { | ||
| if (expiredSilos.Contains(kvp.Value.SiloAddress!)) | ||
| { | ||
| if (_directory.Remove(kvp.Key)) | ||
| { |
There was a problem hiding this comment.
CleanupExpiredLeasesCore iterates _directory using foreach while calling _directory.Remove(...) inside the loop. Removing entries from a Dictionary during enumeration will throw InvalidOperationException, so lease cleanup will repeatedly fail (and only be caught/logged). Collect the keys to remove first (or use ToArray()/List<GrainId>), then remove them in a separate loop; also consider using a HashSet<SiloAddress> for expiredSilos to avoid repeated Contains scans.
| // If it was ShuttingDown, it surrendered its ownership gracefully. | ||
| // If it was Active (or Joining) and suddenly became Dead, it crashed. | ||
|
|
||
| if (previousStatus is not SiloStatus.ShuttingDown && _leaseHoldDuration > TimeSpan.Zero) |
There was a problem hiding this comment.
Graceful shutdowns typically transition ShuttingDown -> Stopping -> Dead. This check only treats previousStatus == ShuttingDown as graceful, so a gracefully-stopping silo will likely have previousStatus == Stopping when the Dead update arrives and will incorrectly get a safety lease hold applied. Consider treating Stopping (and possibly any previousStatus.IsTerminating()) as graceful to match the intent in the comment/PR description.
| // If it was ShuttingDown, it surrendered its ownership gracefully. | |
| // If it was Active (or Joining) and suddenly became Dead, it crashed. | |
| if (previousStatus is not SiloStatus.ShuttingDown && _leaseHoldDuration > TimeSpan.Zero) | |
| // If it was ShuttingDown or Stopping, it surrendered its ownership gracefully. | |
| // If it was Active (or Joining) and suddenly became Dead, it crashed. | |
| var isGracefulShutdown = previousStatus is SiloStatus.ShuttingDown or SiloStatus.Stopping; | |
| if (!isGracefulShutdown && _leaseHoldDuration > TimeSpan.Zero) |
| _leaseHoldDuration = directoryOptions.Value.SafetyLeaseHoldDuration switch | ||
| { | ||
| null => 2 * membershipOptions.Value.ProbeTimeout * membershipOptions.Value.NumMissedProbesLimit, | ||
| TimeSpan duration when duration >= TimeSpan.Zero => duration, | ||
| _ => throw new InvalidOperationException("Lease hold duration must be non-negative.") | ||
| }; |
There was a problem hiding this comment.
The PR description says lease duration is computed when the configured duration is <= 0, but the implementation here computes the duration only when SafetyLeaseHoldDuration is null and treats TimeSpan.Zero as an explicit "disable leases" value (and throws for negatives). Either update the PR description (and/or option docs) to match this behavior, or adjust the implementation to match the described <= 0 semantics.
ReubenBond
left a comment
There was a problem hiding this comment.
We should add some more tests. I'll ask copilot to add some more coverage.
…lookup, and multi-grain scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a safety lease hold is active (e.g., after an ungraceful silo crash or during cluster startup with non-contiguous view changes), the partition returns DirectoryResult.RetryAfter(remainingDuration) instead of throwing. InvokeAsync checks RetryAfterDelay and waits the suggested duration before retrying, keeping the retry inside the directory layer. Previously, throwing DirectoryLeaseHoldException would propagate to the activation layer, which deactivates with DirectoryFailure and forwards the message to the same silo. Because DirectoryFailure is treated as a transient error, the message ForwardCount is decremented on each cycle, preventing the hop limit from being reached and creating an infinite zero-delay forwarding loop that hangs the system. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return RetryAfter from RegisterAsync instead of throwing, so InvokeAsync waits the exact remaining lease duration using the injected TimeProvider. This removes the DirectoryLeaseHoldException class entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@ledjon-behluli I made a minor change to not throw when a lease is still held. Now, the directory returns a 'retry after delay' result instead. |
This PR adds safety lease holds to the new strongly-consistent grain directory to prevent split-brain scenarios after silo crashes, utilizing both specific grain leases (tombstones) and broader range leases when snapshot transfers fail. Under normal graceful shutdowns the new directory performs handovers, so no range lease holds are applied.
Users can specify the lease duration, otherwise if duration <= zero than it is computed as:
2 × ProbeTimeout × NumMissedProbesLimitwhich should provide a good buffer so that a network-partitioned silo has enough time to realize its isolated and shut itself down before the rest of the cluster allows those grains to reactivate elsewhere.The impact on normal registrations in a healthy cluster should be practically unobservable.
Note, this wont work with the eventually-consistent directory!
Microsoft Reviewers: Open in CodeFlow